Domain generation algorithms detection with feature extraction and Domain Center construction

Network attacks using Command and Control (C&C) servers have increased significantly. To hide their C&C servers, attackers often use Domain Generation Algorithms (DGA), which automatically generate domain names for C&C servers. Researchers have constructed many unique feature sets and detected DGA domains through machine learning or deep learning models. However, due to the limited features contained in the domain name, the DGA detection results are limited. In order to overcome this problem, the domain name features, the Whois features and the N-gram features are extracted for DGA detection. To obtain the N-gram features, the domain name whitelist and blacklist substring feature sets are constructed. In addition, a deep learning model based on BiLSTM, Attention and CNN is constructed. Additionally, the Domain Center is constructed for fast classification of domain names. Multiple comparative experiment results prove that the proposed model not only gets the best Accuracy, Precision, Recall and F1, but also greatly reduces the detection time.


Introduction
Malware has now developed into the number one public enemy threatening network security. In order to avoid the detection of security facilities, its production process is becoming more and more complex. One typical approach is to integrate Domain Generation Algorithm (DGA) [1] into the software to generate a large number of rapidly changing domain names. As a backup or main means of communication with Command and Control (C&C) server, this method can effectively increase the robustness of botnet [2], so as to continuously control the infected host. Correspondingly, the research on DGA algorithm has becoming a hot topic of network security. However, due to the fast updating speed of DGA domains, existing research methods have too many false positives in practical use. Therefore, the detection of DGA domains is still an arduous task in the computer security field.
Discovering DGA domains is very important for maintaining network security. The existing solutions mainly include static blacklist [3], reverse engineering [4], machine learning [5] and deep learning [6]. Due to the slow update speed of static blacklist and the fast update speed of DGA domains, it is difficult to effectively apply static blacklist to DGA detection. Reverse engineering requires a malware sample, which is not always feasible for DGA detection. In recent years, machine learning and deep learning methods provide new hope for DGA detection. Machine learning and deep learning methods construct the feature set of domain names and combine machine learning or deep learning models to detect DGA domains. Since the deep learning model has stronger nonlinear modeling ability than the machine learning model, it can detect DGA domains more accurately. Although there have been some studies on detecting DGA domains through deep learning models [7][8][9], most of them only construct feature set through domain name. Due to the limited effective information contained in the domain name, the DGA domains can not be accurately detected. To overcome the above shortcomings, a feature set with rich features is constructed. Considering that the Whois [10] of the domain name contains rich features (e.g. registrar, registration time, etc.) related to the domain name category, the Whois features are extracted to construct the feature set. Considering that the N-gram [11] features of the domain name also contains rich information that can reflect whether the domain name is malicious, the blacklist and whitelist substring datasets of the domain name are constructed to obtain the N-gram features.
Due to the great ability in sequential modeling, Recurrent Neural Network (RNN) [12] is widely used in DGA detection among the vast type of deep learning models. Although RNN has made some achievements in DGA detection, it still has defects. Firstly, when the sequence is too long, the gradient disappearance problem inevitably occurs in RNN. Secondly, RNN assigns the same weight to all features. Finally, the high dimensionality of RNN makes the model difficult to converge. In the deep neural network, it has been proved that the more complex the neural network structure is, the better the effect is [13]. Therefore, Bi-Directional Long Short-Term Memory (BiLSTM), Attention and Convolutional Neural Network (CNN) are used to construct the DGA detection model. BiLSTM [14] is used to solve the gradient disappearance problem. Attention mechanism [15] is used to assign different weights to different features. CNN [16] is used to reduce the high dimensionality problem. In addition, skip connect [17] is used at the output of Attention network to solve the gradient disappearance and weight matrix degradation problems.
Although the classification results of the deep learning model will get better as the number of layers increases, the time spent will also increase [13]. Therefore, in order to reduce the DGA detection time on the validation set, the Domain Center is constructed. When there is a new domain name input, the feature vector of it is first obtained, then the hidden vector of it is obtained by the constructed deep learning model. Finally, the Euler distances [18] between the hidden vector and the mean vectors stored in the Domain Center are calculated to obtain the final classification results.
The main innovations of us are as follows: 3. The Domain Center is proposed to reduce the DGA detection time on the validation set.
The remainder of this work is organized as follows. Section 2 introduces the latest research results of DGA detection; Section 3 introduces the background of BILSTM, Attention mechanism and CNN; Section 4 introduces the construction method of the feature set; Section 5 introduces the structure of the deep learning model constructed in this paper; Section 6 introduces the data set selected in this paper and the experimental results; Section 7 provides a final conclusion.

Related work
DGA detection methods include blacklist, reverse engineering, machine learning and deep learning. Although the blacklist method can provide effective security and is used by most network security companies [19,20], its inherent defects in update speed make it easy for DGA domains to bypass the detection of blacklist. Reverse engineering requires a malware sample, which is not always feasible for DGA detection [21]. Therefore, most of the researches on DGA detection focus on machine learning and deep learning methods. Machine learning method first construct DGA domain feature set, and then realizes DGA detection by machine learning models. Tuan et al. [22] proposed a machine learning based DGA detection model using TF-IDF and n-gram for feature representation. The results showed that logistic regression and SVM were the most effective. Štampar et al. [23] engineered a robust feature set, and accordingly trained and evaluated 14 ML, 9 DL, and 2 comparative models on two independent datasets. The experimental results showed that if ML features are properly engineered, there is a marginal difference in overall score between top ML and DL representatives. Soleymani et al. [24] applied machine learning algorithm and text mining technology to analyze DNS protocol and identify DGA botnets. The experimental results showed that the Random Forest could be effectively used in DGA botnet detection and had the best DGA botnet detection accuracy. Chin et al. [25] proposed a machine learning framework for identifying and clustering domain names to circumvent threats from the DGAs. Li et al. [26] proposed a machine learning framework (a two-level model and a prediction model) for identifying and detecting DGA domains to alleviate the threat of them. Baruch et al. [27] surveyed different machine learning methods for detecting DGAs by analyzing only the alphanumeric characteristics of the domain names in the network.
The deep learning method also needs to construct the feature set of DGA domains, and then realizes DGA detection through the deep learning models. Tuan et al. [28] proposed solutions for detecting and classifying DGA families. They proposed two deep learning models called LA_Bin07 and LA_Mul07 by combining the LSTM network and Attention layer. The experimental results showed that the LA_Bin07 and LA_Mul07 models solved the DGA botnets problem for binary and multiclass classification problems with very high accuracy. Namgung et al. [29] proposed an efficient DGA detection method based on BiLSTM, which further maximized the detection performance by using the CNN + BiLSTM integrated model, and allowed the model to learn local and global information at the same time in the domain sequence. The experimental results showed that the existing CNN and LSTM models had obtained F1 scores of 0.9384 and 0.9597 respectively, while the proposed BiLSTM and integrated model had obtained F1 scores of 0.9618 and 0.9666 respectively. Liang et al. [30] proposed three feature extraction methods adapted to the length of the DGA domains. In addition, they further analyzed the public suffix to evaluate its impact on the detection of DGA domains. The experimental results showed that the method greatly improved the detection performance. Lison et al. [31] demonstrated that a deep learning approach based on RNN was able to detect domain names generated by DGAs with high precision. Xu et al. [32] combined n-gram and a deep CNN to propose a novel n-gram combined character based domain classification (N-CBDC) model. Experiments on real-world data showed that N-CBDC could effectively detect DGAs. Ren et al. [33] proposed a deep learning framework for identifying and detecting DGA domains.
It can be seen from previous researches that the primary task of both machine and deep learning methods is to construct the DGA domain feature sets, which plays a key role in DGA detection. Considering that the existing researches basically construct feature set based on domain name only, which contains limited features, domain name features, Whois features and N-gram features are combined to construct feature set containing rich features.

LSTM
LSTM [34] is a variation of RNN [12], and the structure of LSTM is shown in Fig 1. LSTM includes an input gate i t , a forget gate f t and an output gate o t . The forget gate f t accepts the output of the previous unit module C t−1 and decides which part to keep and forget, which is calculated as follows [34]:

PLOS ONE
Domain generation algorithms detection with feature extraction and Domain Center construction Where x t is the current input, σ() is the element-wise sigmoid function, W f and U f are the weight matrices and b f is the bias term. The input gate i t determines which information is recorded into the cell state, and the cell state c t is obtained by merging i t and the new memorỹ c t . The formula of the input gate is as follows [34]: Where W i , W c , U i and U c are the weight matrices, b i and b c are the bias terms, � is the element-wise multiplication, tanh() is the activation function Relu. The output gate o t determines the output value based on the cell state c t . A Sigmiod function is first used to determine which part of c t needs to be output, then c t is processed through the tanh() layer, and finally o t and tanh(c t ) are multiplied to get the final desired output. Which can be denoted by [34]: Where W o and U o are the weight matrices and b o is the bias term.

Attention mechanism
As the length of input sentences increases, the ability of LSTM to remember connections between words that are too far apart in a sentence decreases. Attention mechanism [15] solves the above problem by considering all input words to create a context vector and assigning relative weights to them. The structure of the Attention mechanism is shown in Fig 2. In Fig 2, x = (x 1 , x 2 , � � �, x T ) represents the input of LSTM, h = (h 1 , h 2 , � � �, h T ) represents the output through hidden layer of LSTM. The correlation e tj between the jth input h j and the current hidden state s t−1 is calculated as follows [15]: Where score() is a correlation operator, and the weighted dot product is chosen in this paper. A softmax transformation is performed on e tj to obtain the corresponding probability a tj , which is calculated as follows [15]: The context vector c i of time step i is obtained by weighting the sum of a tj as follows [15]:

CNN
CNNs [16] capture local information and reduce dimensionality by one-dimensional (1D) convolution and pooling operations. The structure of the CNN for text classification is shown in Fig 3. In the convolution layer, features are extracted with the help of various filters. Intermediate procedures are applied between the convolution layer and the pooling layer to make features nonlinear with the help of linear unit activation functions. In the pooling layer, these feature graphs are reduced in size, reducing the computational effort of subsequent layers and displaying important features more efficiently.

PLOS ONE
Domain generation algorithms detection with feature extraction and Domain Center construction website or the specific purpose of the domain name. Therefore, the domain name features shown in Table 1 are obtained.

Whois feature set
Whois domain name database stores the information of all registered domain names. Whois information can be used to check the availability of domain names, identify trademark infringement, and hold domain name registrants accountable. Rich information such as

PLOS ONE
Domain generation algorithms detection with feature extraction and Domain Center construction registrant, registration time, DNS and so on can be obtained through Whois. As shown in Table 2, the Whois features are obtained and converted into int data.

N-gram feature set
4.3.1 Domain name whitelist substring N-gram feature set. Alexa ranking represents the world ranking of website popularity. Alex top 1 million stores the top 1 million websites in order of popularity, therefore, the top ranked websites in Alex top 1 million are usually the ones with higher credibility. Therefore, top 100,000 domain names of Alex top 1 million are selected to build the domain name whitelist substring feature set (DNWSFS), and the number of substrings of each domain name to be tested appared in the DNWSFS is obtained. The specific process is as follows.
Step one, remove the special characters of the 100,000 domain names of Alex top 1 million, and split them into substrings through the N-gram method. N-gram slides from the left to the right of the domain name using a sliding window of length N. Taking "domainname" as an example, the process of 4-gram is shown in Fig 5. According to the empirical value, the values of N are set to 3-8, and the substrings of 3-gram to 8-gram of the 100,000 domain names in Alex top 1 million are obtained respectively. Step two, count the number of occurrences of each substring in step one, so as to build the DNWSFS. DNWSFS contains all the substrings and the number of times they occur.
Step three, remove the characteristic characters of each domain name to be tested, and obtain the substrings of 3-gram to 8-gram of them.
Step four, query the number of occurrences in DNWSFS of the 3-gram to 8-gram substrings of the domain name to be tested.
According to the above process, the domain name whitelist substring N-gram features are obtained as shown in Table 3.  Table 4.

Method
The publicly available data are selected in this artical and all data can be publicly accessed by everyone. The domain names of Alex top can be downloaded from: https//www.alexa.com/. The DGA domain names can be downloaded from: https://data.netlab.360.com/.

PLOS ONE
Domain generation algorithms detection with feature extraction and Domain Center construction DGA detection model based on feature extraction and Domain Center construction, FEDCC, is constructed in this paper, and the structure of the model is shown in Fig 6. Firstly, as described in Section 4, the domain name features, Whois features and N-gram features of the input layer are extracted to form into a feature vector of domain name. Secondly, the feature vector is input into BiLSTM network to obtain the hidden vector. Thirdly, the Attention network is used to assign different degrees of attention to the hidden vector. Fourthly, the feature vector and the hidden vector output by the Attention network are added as the result of the skip connect network. Fifthly, the result of the skip connect network is input into the CNN network. 1D convolution is used to further extract hidden relationships, and max pooling is used to reduce dimension. Sixthly, the output of CNN network is input into the fully connected network, and the final classification result is obtained through the softmax function. Finally, The hidden vectors of all samples through the CNN network are input to the Domain Center, and the mean vectors of different categories of samples are further obtained. When a new domain name is input, it is only necessary to calculate the Euler distances between the hidden vector obtained by the deep learning model of the domain name and the mean vectors stored in the Domain Center to quickly achieve classification. The specific process is as follows.

BiLSTM network
Unlike LSTM [34], which can only use the information before the current time node, BiLSTM [14] can use the forward and backward timing characteristics at the same time. As described in Section 4, feature vector V = {v 1 , v 2 , � � �, v n } with length n is obtained. Then, V is input into the forward and backward networks of BiLSTM to obtain the forward and backward hidden Then, the forward and backward hidden vectors are combined to obtain the bidirectional hidden vector h:

Attention network
Attention mechanism assigns different attentions to different features. Specifically, firstly, input h into the Attention network to obtain the hidden representation x: Where W represents the weight and b represents the bias term. Secondly, the attentions of the features are calculated according to the similarity between y and x, where y is the randomly initialized feature vector. After the attentions are obtained, the softmax function is used for normalization, and then the weight vector r is obtained: Finally, comment vector f containing all feature concerns is obtained through the weighted sum of r:

Skip connect network
As the number of network layers deepens, the objective function is more and more likely to fall into local optimal solutions, while the problems of weight matrix degradation and gradient disappearance becomes more serious. Since the skip connect network [35] directly takes the input data as part of the output, it can alleviate the above problems well. Add V and f to obtain the result V sk of skip connect:

CNN network
CNN network includes 1D convolution and max pooling. First, input V sk into the 1D convolution network to further extract hidden relationships in hidden vector: Where W cnn represents the weight and b cnn represents the bias term. Then, input t into the max pooling network to obtain the pooling operation result g:

Output
Input g into the fully connected network to obtain the final classification label through softmax function:  original DGA detection method is as follows.

Domain Center
Test i ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi It can be seen that the time complexity of traditional DGA detection is o(l × n). The Domain Center is proposed to reduce detection time and the processes of Domain Center are as follows.
x 2i ; X label ¼ 2 :::::: Where l b denotes the number of benign domains. X label = 0 denotes the benign domains. l i denotes the number of DGA domains labeled i, i = 1, 2, � � �, c indicates the class of the DGA. The improved similarity is calculated as follows: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ðb 1 À y 1 Þ 2 þ ðb 2 À y 2 Þ 2 þ ::: þ ðb n À y n Þ 2 q Test T 1 ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ðd 11 À y 1 Þ 2 þ ðd 12 À y 2 Þ 2 þ ::: þ ðd 1n À y n Þ 2 q Test T 2 ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ðd 21 À y 1 Þ 2 þ ðd 22 À y 2 Þ 2 þ ::: þ ðd 2n À y n Þ 2 q :::::: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ðd c1 À y 1 Þ 2 þ ðd c2 À y 2 Þ 2 þ ::: þ ðd cn À y n Þ 2 q y label ¼ x label minðTest B ; Test T 1 ; Test T 2 ; :::; It can be seen that the time complexity of the Domain Center is o(c+ 1), which will greatly reduce the time of domain name classification.

Dataset
DGA domains: 58,000 domain names are selected from 360netlab, which exclude 100,000 domain names used to build DNBSFS. The dataset contains all 58 DGA families such as tordwm, dircrypt and fobber, and each DGA family contains 1,000 domain names.
Benign domain names: 58,000 domain names are selected from Alex top 1 million as benign domain names, which exclude 100,000 domain names used to build DNWSFS. WSDL [7]: This model proposes a set of heuristic algorithm, which automatically marks the domain names monitored in the real traffic through the weakly supervised deep learning algorithm.

Baseline methods
HDNN [8]: This model adopts an improved parallel CNN architecture with multi-scale convolution kernels to extract multi-scale local features from domain names. The framework also includes a BiLSTM architecture based on self-attention, which can extract bi-directional global features with Attention mechanism from domain names.
DNSML [36]: The above baseline methods are reproduced according to the references, and the optimal results of each baseline method are obtained through parameter tuning.

Experimental environment
Two layers of BiLSTM and three layers of CNN are included in this paper. The neurons of the two BiLSTM networks are 128 and 256, respectively, and 0.3 dropout is used at the end of the second layer. Each layer of CNN consists of a convolutional layer and a max pooling layer, and a droupout of 0.3 is used after each pooling layer. The sizes of the three-layer convolution kernel are all 3 × 3, and the numbers are 128, 64, 64 respectively. The sizes of the pooling template are all 3. The epoch is 150, the batch size is 32, and the learning rate is 0.00001.

Evaluation criteria
Accuracy, Precision, Recall and F1 are used as evaluation criteria, which are calculated as follows.
Where P and N respectively represent the number of DGA and benign domains, TP and TN respectively represent the number of correctly predicted DGA and benign domains, and FP and FN respectively represent the number of incorrectly predicted DGA and benign domains.

Experimental results
FEDCC and baseline methods are applied to DGA and benign domain name datasets respectively, and the DGA detection results are shown in Table 5. It can be seen from Table 5 that FEDCC obtains the best classification Accuracy, Precision, Recall and F1, which are 0.9713, 0.9627, 0.9765 and 0.9696, respectively. In the baseline methods, the Accuracy, Precision, Recall and F1 of HAGD, LA-BM07, ATT-CNN-BiLSTM and HDNN are greater than 0.9, the Accuracy, Precision, Recall and F1 of CNN-BiLSTM, DGA-RNN and N-CBDC are between 0.8-0.9, and these of the remaining methods are less than 0.8. Among them, DNSML obtains the worst Accuracy, Precision, Recall and F1, which are 0.2394, 0.2368, 0.2893 and 0.2636 lower than FEDCC respectively. HAGD obtains the best Accuracy, Precision, Recall and F1, which are 0.0191, 0.0166, 0.0264 and 0.0215 lower than FEDCC respectively. In terms of classification time, FEDCC obtains the optimal classification time, which is 1.3s. In the baseline methods, ATT-CNN-BiLSTM obtains the longest classification time, which is 113.2s, 87 times that of FEDCC. DNSML obtains the shortest classification time, which is 15.7s, 12 times that of FEDCC.
Although most baseline methods use deep learning models as classifiers, and some baseline methods use a variety of deep learning models to build complex neural networks, FEDCC still has greatly improved the DGA detection results. We analyze the reasons. WSDL, DNSML and DBD use simple deep learning model with fewer features, which obtain the worst Accuracy, Precision, Recall and F1. Although CNN-BiLSTM, DGA-RNN and N-CBDC jointly use multiple deep learning models, since the features extracted by them are still limited, the Accuracy, Precision, Recall and F1 of them are not that ideal. HDNN, ATT-CNN-BiLSTM, LA-BM07, and HAGD use more complex deep learning models and features that are not rich enough. Although their Accuracy, Precision, Recall and F1 are greater than 0.9, they are still lower than that of FEDCC. In addition to the domain name features, FEDCC not only obtains the Whois features, but also obtains the N-gram features by constructing the DNWSFS and DNBSFS. Through rich feature acquisition and complex deep learning model design, FEDCC greatly improves the Accuracy, Precision, Recall and F1. In addition, the time complexity is reduced

PLOS ONE
Domain generation algorithms detection with feature extraction and Domain Center construction from O(l × n) to O(c + 1) by constructing the Domain Center, which greatly reduces the classification time.
6.6 Model analysis 6.6.1 Receiver Operating Characteristic (ROC) curves. The DGA detection dataset constructed in Section 6.1 belongs to the binary classification balance dataset. Therefore, in order to better interpret the DGA detection results of FEDCC and baseline methods, the ROC curves of the detection results of each comparison models are drawn in Fig 9,  Where ACC FEDCC represents the DGA detection accuracy of FEDCC, and ACC represents the DGA detection Accuracy after changing each component. The results are shown in Table 6. It can be seen from Table 6 that when any component of FEDCC is removed, the detection Accuracy is reduced, and the removal of a feature set reduces the Accuracy of the model to a greater extent than the removal of the Domain Center or a component of the deep learning model. By analyzing the three feature sets, it can be seen that Whois features has the greatest impact on the model, followed by the N-gram features and the domain name features. By analyzing the four components of the deep learning model, it can be seen that BiLSTM has the greatest impact on the model, followed by the Attention, skip connect and CNN. Although the Domain Center is less important in classification accuracy than the feature set, when it is removed, classification time increases substantially. Therefore, the Domain Center can greatly reduce classification time while improving classification accuracy. 6.6.4 Importance analysis of different features. Domain name features, Whois features and N-gram features are obtained to construct a feature vector with a length of 35. Through the above experiments, we know that the construction of the feature vector plays a great role in promoting the DGA detection results. In order to further study the influence of different features on the detection results, the importance of different features is analyzed through the following experimentS. We refer to the official document of LightGBM to calculate the importanceS of 35 features, and the calculation formulas are as follows: Where c 1 and c 2 are the number of objects in each leaf respectively, v 1 and v 2 are the formula values in the left and right leaves respectively. The importance results of 35 features are shown in Fig 14. It can be seen from Fig 14 that most of the Whois features are more important than the domain name features and the N-gram features, which further verifies the experimental results in Section 6.6.3. Analyzing the reason, Whois contains entity information such as domain name registrar and server, and the extraction of these information greatly improves the DGA detection results. 6.6.5 Analysis of the Domain Center. The Domain Center is proposed to quickly detect whether the newly entered domain name is a DGA domain through Euler distance [18]. Since there are many methods to calculate the vector similarity, the following experiments are carried out to explain why Euler distance is used in this paper. Besides Euler distance, Manhattan distance [38], cosine similarity [39], Minkowski distance [40] and Chebyshev distance [41] are also selected, and the experimental results of different similarity methods are shown in Table 7. It can be seen that although Manhattan distance uses less time and Chebyshev distance achieves the highest Precision, Euler distance achieves the overall optimal results. Therefore, Euler distance is selected to quickly classify DGA domains. The mean vectors of the hidden vectors of the datasets in Section 6.1 are stored in the Domain Center. The difference in the number of samples in the training set will cause the difference in the mean vectors stored in the Domain Center, which in turn affects the DGA detection results. The following experiments verify the relationship between the DGA detection results and the amount of samples in the training set, and the experimental results are shown in Table 8. It can be seen that as the number of samples in each category increases, the DGA detection results gradually improve. When the number of samples in each category exceeds 1,000, the increase of DGA detection results with the increase of the number of samples becomes smaller and smaller, but the training time is proportional to the amount of samples. Therefore, the number of samples for each category is set to 1000 in this paper.

Conclusion
Due to the limited features that the traditional DGA detection model can extract, the DGA detection results are not that ideal. In order to solve the above problem, rich feature set, including the domain name features, the Whois features and the N-gram features, and a deep learning model based on BiLSTM, Attention and CNN are constructed for DGA detection. In addition, the Domain Center is built to reduce the DGA detection time. Multiple comparative experiment results prove that the proposed model not only gets the best Accuracy, Precision, Recall and F1, but also greatly reduces the classification time of newly entered domain names. However, the model built in this paper is based on passively acquired data and cannot actively detect DGA domains. In order to better realize the task of DGA detection and governance, our main task in the future is to develop a model to actively detect DGA, so as to actively prevent the infringement of DGA domains.